180 research outputs found
PD2T: Person-specific Detection, Deformable Tracking
Face detection/alignment has reached a satisfactory state in static images captured under arbitrary conditions. Such methods typically perform (joint) fitting independently for each frame and are used in commercial applications; however in the majority of the real-world scenarios the dynamic scenes are of interest. Hence, we argue that generic fitting per frame is suboptimal (it discards the informative correlation of sequential frames) and propose to learn person-specific statistics from the video to improve the generic results. To that end, we introduce a meticulously studied pipeline, which we name PD\textsuperscript{2}T, that performs person-specific detection and landmark localisation. We carry out extensive experimentation with a diverse set of i) generic fitting results, ii) different objects (human faces, animal faces) that illustrate the powerful properties of our proposed pipeline and experimentally verify that PD\textsuperscript{2}T outperforms all the compared methods
A Comprehensive Performance Evaluation of Deformable Face Tracking "In-the-Wild"
Recently, technologies such as face detection, facial landmark localisation
and face recognition and verification have matured enough to provide effective
and efficient solutions for imagery captured under arbitrary conditions
(referred to as "in-the-wild"). This is partially attributed to the fact that
comprehensive "in-the-wild" benchmarks have been developed for face detection,
landmark localisation and recognition/verification. A very important technology
that has not been thoroughly evaluated yet is deformable face tracking
"in-the-wild". Until now, the performance has mainly been assessed
qualitatively by visually assessing the result of a deformable face tracking
technology on short videos. In this paper, we perform the first, to the best of
our knowledge, thorough evaluation of state-of-the-art deformable face tracking
pipelines using the recently introduced 300VW benchmark. We evaluate many
different architectures focusing mainly on the task of on-line deformable face
tracking. In particular, we compare the following general strategies: (a)
generic face detection plus generic facial landmark localisation, (b) generic
model free tracking plus generic facial landmark localisation, as well as (c)
hybrid approaches using state-of-the-art face detection, model free tracking
and facial landmark localisation technologies. Our evaluation reveals future
avenues for further research on the topic.Comment: E. Antonakos and P. Snape contributed equally and have joint second
authorshi
Motion deblurring of faces
Face analysis is a core part of computer vision, in which remarkable progress
has been observed in the past decades. Current methods achieve recognition and
tracking with invariance to fundamental modes of variation such as
illumination, 3D pose, expressions. Notwithstanding, a much less standing mode
of variation is motion deblurring, which however presents substantial
challenges in face analysis. Recent approaches either make oversimplifying
assumptions, e.g. in cases of joint optimization with other tasks, or fail to
preserve the highly structured shape/identity information. Therefore, we
propose a data-driven method that encourages identity preservation. The
proposed model includes two parallel streams (sub-networks): the first deblurs
the image, the second implicitly extracts and projects the identity of both the
sharp and the blurred image in similar subspaces. We devise a method for
creating realistic motion blur by averaging a variable number of frames to
train our model. The averaged images originate from a 2MF2 dataset with 10
million facial frames, which we introduce for the task. Considering deblurring
as an intermediate step, we utilize the deblurred outputs to conduct a thorough
experimentation on high-level face analysis tasks, i.e. landmark localization
and face verification. The experimental evaluation demonstrates the superiority
of our method
Unsupervised Controllable Generation with Self-Training
Recent generative adversarial networks (GANs) are able to generate impressive
photo-realistic images. However, controllable generation with GANs remains a
challenging research problem. Achieving controllable generation requires
semantically interpretable and disentangled factors of variation. It is
challenging to achieve this goal using simple fixed distributions such as
Gaussian distribution. Instead, we propose an unsupervised framework to learn a
distribution of latent codes that control the generator through self-training.
Self-training provides an iterative feedback in the GAN training, from the
discriminator to the generator, and progressively improves the proposal of the
latent codes as training proceeds. The latent codes are sampled from a latent
variable model that is learned in the feature space of the discriminator. We
consider a normalized independent component analysis model and learn its
parameters through tensor factorization of the higher-order moments. Our
framework exhibits better disentanglement compared to other variants such as
the variational autoencoder, and is able to discover semantically meaningful
latent codes without any supervision. We demonstrate empirically on both cars
and faces datasets that each group of elements in the learned code controls a
mode of variation with a semantic meaning, e.g. pose or background change. We
also demonstrate with quantitative metrics that our method generates better
results compared to other approaches
On the Convergence of Encoder-only Shallow Transformers
In this paper, we aim to build the global convergence theory of encoder-only
shallow Transformers under a realistic setting from the perspective of
architectures, initialization, and scaling under a finite width regime. The
difficulty lies in how to tackle the softmax in self-attention mechanism, the
core ingredient of Transformer. In particular, we diagnose the scaling scheme,
carefully tackle the input/output of softmax, and prove that quadratic
overparameterization is sufficient for global convergence of our shallow
Transformers under commonly-used He/LeCun initialization in practice. Besides,
neural tangent kernel (NTK) based analysis is also given, which facilitates a
comprehensive comparison. Our theory demonstrates the separation on the
importance of different scaling schemes and initialization. We believe our
results can pave the way for a better understanding of modern Transformers,
particularly on training dynamics
Unsupervised Controllable Generation with Self-Training
Recent generative adversarial networks (GANs) are able to generate impressive photo-realistic images. However, controllable generation with GANs remains a challenging research problem. Achieving controllable generation requires semantically interpretable and disentangled factors of variation. It is challenging to achieve this goal using simple fixed distributions such as Gaussian distribution. Instead, we propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training. Self-training provides an iterative feedback in the GAN training, from the discriminator to the generator, and progressively improves the proposal of the latent codes as training proceeds. The latent codes are sampled from a latent variable model that is learned in the feature space of the discriminator. We consider a normalized independent component analysis model and learn its parameters through tensor factorization of the higher-order moments. Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder, and is able to discover semantically meaningful latent codes without any supervision. We demonstrate empirically on both cars and faces datasets that each group of elements in the learned code controls a mode of variation with a semantic meaning, e.g. pose or background change. We also demonstrate with quantitative metrics that our method generates better results compared to other approaches
The 3D Menpo Facial Landmark Tracking Challenge
This is the final version of the article. It is the open access version, provided by the Computer Vision Foundation. Except for the watermark, it is identical to the IEEE published version. Available from IEEE via the DOI in this record.Test descriptionRecently, deformable face alignment is synonymous to the task of locating a set of 2D sparse landmarks in intensity images. Currently, discriminatively trained Deep Convolutional Neural Networks (DCNNs) are the state-of-the-art in the task of face alignment. DCNNs exploit large amount of high quality annotations that emerged the last few years. Nevertheless, the provided 2D annotations rarely capture the 3D structure of the face (this is especially evident in the facial boundary). That is, the annotations neither provide an estimate of the depth nor correspond to the 2D projections of the 3D facial structure. This paper summarises our efforts to develop (a) a very large database suitable to be used to train 3D face alignment algorithms in images captured "in-the-wild" and (b) to train and evaluate new methods for 3D face landmark tracking. Finally, we report the results of the first challenge in 3D face tracking "in-the-wild".The work of S. Zafeiriou and A. Roussos has been partially funded by the EPSRC Project EP/N007743/
- …